PhD

The LaTeX sources of my Ph.D. thesis
git clone https://esimon.eu/repos/PhD.git
Log | Files | Refs | README | LICENSE

knowledge base.tex (25908B)


      1 \section{Knowledge Base}
      2 \label{sec:context:knowledge base}
      3 Our goal is to extract structured knowledge from text.
      4 In this section, we introduce the object we use to express this knowledge, namely the knowledge base.
      5 A knowledge base is a symbolic semantic representation of some piece of knowledge.
      6 It is defined by a set of concepts, named \emph{entities}, and by the relationships linking these entities together, named \emph{facts} or \emph{statements}.
      7 Formally, a knowledge base is constructed from a set of entities \(\entitySet\), a set of relations \(\relationSet\) and a set of facts \(\kbSet\subseteq\entitySet\times\relationSet\times\entitySet\).
      8 Note that these facts purpose to encode some kind of truth about the world.
      9 To illustrate, here are some examples from Wikidata~\parencite{wikidata}:
     10 \begin{align*}
     11 	\entitySet = \{ & \wdent{90}\text{(Paris)}, \wdent{7251}\text{(Alan Turing)}, \dotsc\} \\
     12 	\relationSet = \{ & \wdrel{1376}\text{(\textsl{capital of})}, \wdrel{19}\text{(\textsl{place of birth})}, \dotsc\} \\
     13 	\kbSet = \{
     14 		& \wdent{90}~\wdrel{1376}~\wdent{142}\text{ (Paris is the capital of France)}, \\
     15 		& \wdent{3897}~\wdrel{1376}~\wdent{916}\text{ (Luanda is the capital of Angola)}, \\
     16 		& \wdent{7251}~\wdrel{19}~\wdent{122744}\text{ (Alan Turing was born in Maida Vale)}, \\
     17 		& \wdent{164047}~\wdrel{19}~\wdent{23311}\text{ (Alexander Pope was born in London)}, \\
     18 		& \dotsc\} \hfill
     19 \end{align*}
     20 
     21 As indicated by the identifiers such as \wdent{7251}, knowledge bases link concepts together.
     22 An entity is a concept that may have several textual representations---surface forms---such as ``Alan Turing'' and ``Alan Mathison Turing.''
     23 Here, we showed the Wikidata identifier whose purpose is to identify concepts uniquely.
     24 For ease of reading, when there is no ambiguity between an entity and one of its surface forms, we simply write the surface form without the identifier of its associated concept.
     25 
     26 \begin{marginfigure}
     27 	\centering
     28 	\input{mainmatter/context/fact.tex}
     29 	\scaption[Structure of a knowledge base fact.]{
     30 		Structure of a knowledge base fact.
     31 		\label{fig:context:fact}
     32 	}
     33 \end{marginfigure}
     34 
     35 Given two entities \(e_1, e_2\in\entitySet\) and a relation \(r\in\relationSet\), we simply write \tripletHolds{e_1}{r}{e_2} as a shorthand notation for \((e_1, r, e_2)\in\kbSet\), meaning that \(r\) links \(e_1\) and \(e_2\) together.
     36 As illustrated by Figure~\ref{fig:context:fact}, \(e_1\) is called the \emph{head entity} of the fact or \emph{subject} of the relation \(r\).
     37 Similarly, \(e_2\) is called the \emph{tail entity} or \emph{object}, while \(r\) is called the \emph{relation}, \emph{property} or \emph{predicate}.%
     38 \sidenote{The term \emph{predicate} can either refer to the relation \(r\), or to the couple \((r, e_2)\), thus we will avoid using this terminology.}
     39 
     40 Thanks to this extremely rigid structure, knowledge bases are easier to process algorithmically.
     41 Querying some piece of information from a knowledge base is well defined and formalized.
     42 Query languages such as \textsc{sparql} ensure that information can be retrieved deterministically.
     43 \begin{marginparagraph}
     44 	Example of \textsc{sparql} query for all capital cities in Asia:
     45 
     46 	\smallskip
     47 
     48 	\input{mainmatter/context/sparql.tex}
     49 \end{marginparagraph}
     50 This is in contrast to natural language, where querying some knowledge from a piece of text needs to be performed using an \textsc{nlp} model, thus incurring some form of variance on the result.
     51 With this in mind, it is not surprising that several machine learning models rely on knowledge bases to remove a source of uncertainty from their system; this can be done in a variety of tasks such as question answering \parencite{kb_qa1, kb_qa2}, document retrieval \parencite{kb_document_retrieval} and logical reasoning \parencite{ntn}.
     52 
     53 Commonly used general knowledge bases include Freebase~\parencite{freebase}, \textsc{db}pedia~\parencite{dbpedia} and Wikidata~\parencite{wikidata}.
     54 There are also several domain-specific knowledge bases such as Wordnet~\parencite{wordnet} and GeneOntology~\parencite{geneontology}.
     55 Older works focus on Freebase---which is now discontinued---while newer ones focus on Wikidata and \textsc{db}pedia.
     56 These knowledge bases usually include more information than what was described above.
     57 For example, Wikidata includes statement qualifiers that may modify a statement, such as the fact ``Versailles capital of France'' qualified by ``end time: 5 October 1789.''
     58 For the sake of simplicity, we limit ourselves to triplets in \(\entitySet\times\relationSet\times\entitySet\).
     59 Further details on the specific knowledge bases can be found in Appendix~\ref{chap:datasets}.
     60 
     61 \subsection{Relation Algebra}
     62 \label{sec:context:relation algebra}
     63 \begin{marginparagraph}
     64 	The concept of relation algebra was theorized as a structure for logical systems.
     65 	Developed by several famous mathematicians such as Augustus De Morgan, Charles Peirce and Alfred Tarski, it can be used to express \textsc{zfc} set theory.
     66 	Here we only use relation algebra as a formal framework to express properties of binary relations.
     67 \end{marginparagraph}
     68 Relations linking two entities from the same set of entities \(\entitySet\) are called binary endorelations.
     69 A relation such as ``\textsl{capital of}'' is a subset of the cartesian square \(\entitySet^2\); it is a set of pairs of entities linked together by this relation.
     70 The set of all possible such relations exhibit a structure called a relation algebra \((2^{\entitySet^2}, \relationAnd, \relationOr, \bar{\ }, \relationZero, \relationOne, \relationComposition, \relationIdentity, \breve{\ })\).
     71 We use it as a formalized system of notation for relation properties.
     72 A relation algebra is defined from:
     73 \begin{itemize}
     74 	\item three special relations:
     75 		\begin{itemize}[nosep]
     76 			\item \(\relationZero\), the empty relation linking no entities together (\(\tripletHolds{e_1}{\relationZero}{e_2}\) is always false);
     77 			\item \(\relationOne\), the complete relation linking all entities together (\(\tripletHolds{e_1}{\relationOne}{e_2}\) is always true);
     78 			\item \(\relationIdentity\), the identity relation linking all entities to themselves (\(\tripletHolds{e_1}{\relationIdentity}{e_2}\) is true if and only if \(e_1=e_2\)).
     79 		\end{itemize}
     80 	\item two unary operators:
     81 		\begin{itemize}[nosep]
     82 			\item the complementary relation \(\bar{r}\) which links together entities not linked by \(r\);
     83 			\item the converse \(\breve{r}\) which reverses the direction of the relation such that \(\tripletHolds{e_1}{\breve{r}}{e_2}\) holds if and only if \(\tripletHolds{e_2}{r}{e_1}\) holds.
     84 		\end{itemize}
     85 	\item three binary operators (in order of lowest precedence, to highest precedence):
     86 		\begin{itemize}[nosep]
     87 			\item disjunction \(\tripletHolds{e_1}{(r_1\relationOr r_2)}{e_2}\), either \(r_1\) or \(r_2\) link \(e_1\) with \(e_2\);
     88 			\item conjunction \(\tripletHolds{e_1}{(r_1\relationAnd r_2)}{e_2}\), both \(r_1\) and \(r_2\) link \(e_1\) with \(e_2\);
     89 			\item composition \(\tripletHolds{e_1}{(r_1\relationComposition r_2)}{e_2}\), there exist \(e_3\in\entitySet\) such that both \(\tripletHolds{e_1}{r_1}{e_3}\) and \(\tripletHolds{e_3}{r_2}{e_2}\) hold.
     90 		\end{itemize}
     91 \end{itemize}
     92 \begin{marginparagraph}
     93 	Note that \(\relationComposition\) composes relations in the opposite order of the function composition \(\circ\).
     94 	Indeed while \(f\circ g\) means that \(g\) is applied first, then \(f\) is applied, ``\(\textsl{mother}\relationComposition\textsl{born in}\)'' means that ``\textsl{mother}'' is first applied to the entity, then ``\textsl{born in}'' is applied to the result.
     95 \end{marginparagraph}
     96 
     97 Thanks to this framework, we can express several properties on knowledge base relations since \(\relationSet\subseteq 2^{\entitySet^2}\).
     98 For example, the \emph{functional} property can be stated as \(\breve{r}\,\relationComposition\,r\,\relationOr\,\relationIdentity = \relationIdentity\).
     99 A relation \(r\) is functional when for all entities \(e_1\) there is at most one entity \(e_2\) such that \(\tripletHolds{e_1}{r}{e_2}\) holds.
    100 The relation ``\textsl{born in}'' is functional since all entities are either born at a single place or not born at all.
    101 Taking the above definition this means that for all cities \(c\) if we take all entities who were born in \(c\) (\(\breve{r}\,\color{black!30}\relationComposition\,r\,\relationOr\,\relationIdentity = \relationIdentity\)) and then (\(\color{black!30}\breve{r}\,\color{black}\relationComposition\,\color{black!30}r\,\relationOr\,\relationIdentity = \relationIdentity\)) look at where these entities were born (\(\color{black!30}\breve{r}\,\relationComposition\,\color{black}r\,\color{black!30}\relationOr\,\relationIdentity = \relationIdentity\)), we must be back to \(c\) and only c (\(\color{black!30}\breve{r}\,\relationComposition\,r\,\relationOr\,\relationIdentity \color{black}= \relationIdentity\)) or no such \(c\) shall exist (\(\color{black!30}\breve{r}\,\relationComposition\,r\,\color{black}\relationOr\,\relationIdentity \color{black!30}= \relationIdentity\)).
    102 We need to take the disjunction with \(\relationIdentity\) since some entities were not born anywhere, for example \(\tripletHolds{e}{(\breve{r}\relationComposition r)}{e}\) is false when \(r\) is ``\textsl{born in}'' and \(e\) is ``Mount Everest.''
    103 
    104 Other common properties of binary relations can be defined this way.
    105 One particular property of interest is the restriction of the domain and co-domain of relations.
    106 A lot of relations can only apply to a specific type of entity, such as locations or people.
    107 To express these properties, we use the notation \(\relationOne_X\subseteq \relationOne\) with \(X\subseteq\entitySet\) to refer to the complete relation restricted to entities in \(X\): \(\relationOne_X = \{\, (x_1, x_2) \mid x_1, x_2 \in X \,\}\).
    108 This allows us to define left-restriction (restriction of the domain) and right-restriction (restriction of the co-domain).
    109 Relevant properties are given in Table~\ref{tab:context:relation properties}.
    110 
    111 \begin{margintable}[0mm]
    112 	\centering
    113 	\input{mainmatter/context/relation properties.tex}
    114 	\scaption[Relation properties expressed in relation algebra.]{
    115 		Some fundamental relation properties expressed as conditions in relation algebra.
    116 		\label{tab:context:relation properties}
    117 	}
    118 \end{margintable}
    119 
    120 Some relation properties recurring in the literature are the cardinality constraints.
    121 They can be defined as combinations of the injective and functional properties:
    122 \begin{description}
    123 	\item[Many-to-Many] (\(N\to N\)\,) the relation is neither injective nor functional.\\
    124 		Examples: ``author of,'' ``language spoken,'' ``sibling of.''
    125 	\item[Many-to-One] (\(N\to 1\)) the relation is functional but it is not injective.\\
    126 		Examples: ``place of birth,'' ``country.''
    127 	\item[One-to-Many] (\(1\to N\)\,) the relation is injective but it is not functional.\\
    128 		Examples: ``contains administrative territorial entity,'' ``has part.''
    129 	\item[One-to-One] (\(1\to 1\)) the relation is both injective and functional.\\
    130 		Examples: ``capital,'' ``largest city,'' ``highest point.''
    131 \end{description}
    132 
    133 When a relation \(r\) is one-to-many, its converse \(\breve{r}\) is many-to-one.
    134 The usual way to design relations in knowledge bases is to use many-to-one relations, making one-to-many relations quite rare in practice.
    135 Since most systems handle relations in a symmetric fashion, this has little to no effect.
    136 
    137 Most of the examples given above are not strictly true.
    138 A person can be both registered as being born in Paris and in France.
    139 Some countries do not designate a single capital or share their highest point with a neighbor.
    140 However, defining these properties is helpful to evaluate the abilities of models to capture these kinds of relations.
    141 To handle such cases, these properties can be seen in a probabilistic way.%
    142 \sidenote{
    143 	Given empirical data, the propensity of a relation to be many-to-one can be measured with a conditional entropy \(\entropy(\rndm{e}_2\mid \rndm{e}_1, r)\).
    144 	An entropy close to zero means the relation tends to be many-to-one.
    145 }
    146 
    147 We use the notations from relation algebra to formalize assumptions made on the structure of knowledge bases.
    148 For example several models assume that \(\forall r_1, r_2\in\relationSet: r_1\relationAnd r_2 = \relationZero\), that is all pairs of entities are linked by at most one relation.
    149 A list of common assumptions is provided in Appendix~\ref{chap:assumptions}, it should prove useful from the Chapter~\ref{chap:relation extraction} onwards.
    150 For readers unfamiliar with relation algebra notations, we provide detailed explanation of complex formulae in the margins throughout this thesis.
    151 
    152 \subsection[Distributed Representation through Knowledge Base Completion]{Distributed Representation through Knowledge\\Base Completion}
    153 \label{sec:context:knowledge base completion}
    154 One problem with knowledge bases is that they are usually incomplete.
    155 However, given some information about an entity, it is usually possible to infer additional facts about this entity.
    156 This is called \emph{knowledge base completion}.
    157 Sometimes this inference is deterministic.
    158 For example, if two entities have the same two parents, we can infer that they are siblings.
    159 Quite often, this reasoning is probabilistic.
    160 For example, the head of state of a country usually lives in this country's capital; this probability can be further increased by facts indicating that previous heads of state died in the capital, etc.
    161 
    162 The task of knowledge base completion is essential for our work because of two reasons.
    163 First of all, it is the standard approach to obtain a distributed representation of knowledge base objects.
    164 Second, the models used to tackle this task are often reused as part of relation extraction systems; this is the case of all approaches presented in this section.
    165 
    166 We define two sub-tasks of knowledge base completion: \emph{relation prediction} and \emph{entity prediction}.%
    167 \sidenote[][-6.5mm]{In the literature, both of these tasks can be called ``link prediction'' and ``knowledge graph completion.''}
    168 In the relation prediction task, the goal is to predict the relation between two entities (\tripletHolds{e_1}{?}{e_2}), while entity prediction focuses on predicting a missing entity in a triplet (\tripletHolds{e_1}{r}{?} or \tripletHolds{?}{r}{e_2}).
    169 Historically, this is performed using symbolic approaches.
    170 \begin{marginparagraph}[-7mm]
    171 	Relation prediction is quite similar to our task of interest: relation extraction.
    172 	The main difference being that relation prediction is defined on knowledge bases, while relation extraction takes natural language inputs.
    173 	This parallel is exploited by the model presented in Chapter~\ref{chap:fitb}.
    174 \end{marginparagraph}
    175 For example, this task can be tackled using an inference engine relying on a human expert inputting logical rules such as:
    176 \begin{equation*}
    177 	\sfTripletHolds{e_1}{parent of}{e_2} \land \sfTripletHolds{e_1}{parent of}{e_3} \land e_2\neq e_3 \iff \sfTripletHolds{e_2}{sibling of}{e_3},
    178 \end{equation*}
    179 or using the relation algebra notation introduced in Section~\ref{sec:context:relation algebra}:
    180 \begin{equation*}
    181 	\widebreve{\textsl{parent of}} \,\relationComposition\, \textsl{parent of} \,\relationAnd\,\bar{\relationIdentity} = \textsl{sibling of}.
    182 \end{equation*}
    183 \begin{marginparagraph}[-12mm]
    184 	\tripletHolds{e_2}{\widebreve{\textsl{parent of}}}{e_1} means that \(e_1\) is a parent of \(e_2\).
    185 	Adding a composition to this, \tripletHolds{e_2}{\widebreve{\textsl{parent of}} \,\relationComposition\, \textsl{parent of}}{e_3} means that the aforementioned \(e_1\) has a child \(e_3\).
    186 	This child \(e_3\) could be the same as \(e_2\), this is why we take the conjunction with the complement of the identity relation \(\relationAnd\bar{\relationIdentity}\), thus obtaining the relation \textsl{sibling of}.
    187 \end{marginparagraph}
    188 However, listing all possible logical implications is not feasible.
    189 As with \textsc{nlp}, to tackle this problem, another approach is to leverage distributed representations.
    190 Some good early results were obtained by \textsc{rescal}, which we present in Section~\ref{sec:context:rescal}.
    191 But the problem started to gather a lot of interest in the deep learning community with TransE (Section~\ref{sec:context:transe}) which encodes relations as translation in the semantic space.
    192 This was followed by several other approaches that encoded relations as other kinds of geometric transformations.
    193 All the models presented in this section assume that the entities are embedded in a latent semantic space \(\symbb{R}^d\) with a matrix \(\mtrx{U}\in\symbb{R}^{\entitySet\times d}\) where \(d\) is an hyperparameter.
    194 
    195 \subsubsection{Selectional Preferences}
    196 \label{sec:context:selectional preferences}
    197 Selectional preferences is a simple formalism that purposes to encode each relation with two linear maps assessing the predisposition of an entity to appear as the head or tail of a relation in a true fact.
    198 This can be done using an energy formalism, where the energy of a fact is defined as:
    199 \begin{equation}
    200 	\psi_\textsc{sp}(e_1, r, e_2) = \vctr{u}_{e_1}\transpose \vctr{a}_r + \vctr{u}_{e_2}\transpose \vctr{b}_r
    201 \end{equation}
    202 with \(\mtrx{A}, \mtrx{B}\in\symbb{R}^{\relationSet\times d}\) two matrices encoding the preferences of each relation for certain entities.
    203 This energy function can then be used to define the probability that a fact holds using a softmax:
    204 \begin{equation}
    205 	P(e_1, r, e_2) \propto \exp \psi_\textsc{sp}(e_1, r, e_2),
    206 	\label{eq:context:sp softmax}
    207 \end{equation}
    208 this is sufficient for entity and relation predictions as we can usually compute the partition function over the set of all entities or relations.
    209 If this is not feasible, a technique such as \textsc{nce} (Section~\ref{sec:context:nce}) or negative sampling (Section~\ref{sec:context:negative sampling}) can be used to approximate Equation~\ref{eq:context:sp softmax}.
    210 Still, selectional preferences do not encode the interaction of the head and tail entities.
    211 As such it is quite weak for entity prediction, thus more expressive models are needed.
    212 
    213 \subsubsection{\textsc{rescal}}
    214 \label{sec:context:rescal}
    215 \textsc{rescal}~\parencitex{rescal} purposes to model relations by a bilinear form \(\entitySet\times\entitySet\mapsto\symbb{R}\) in the semantic space of entities.
    216 In other words, each relation \(r\in\relationSet\) is represented by a matrix \(\mtrx{C}_r\in\symbb{R}^{d\times d}\) with the training algorithm seeking to enforce the following property:
    217 \begin{equation*}
    218 	\vctr{u}_{e_1}\transpose \mtrx{C}_r \vctr{u}_{e_2} =
    219 	\begin{cases}
    220 		1 & \quad \text{if \tripletHolds{e_1}{r}{e_2} holds} \\
    221 		0 & \quad \text{otherwise.}
    222 	\end{cases}
    223 \end{equation*}
    224 This can be seen as trying to factorize the tensor of facts \(\tnsr{X}\) as \(\mtrx{U}\tnsr{C}\mtrx{U}\transpose\), where \(\tnsr{X}\in\{0,1\}^{\entitySet\times\relationSet\times\entitySet}\) with \(x_{e_1re_2}=1\) if \tripletHolds{e_1}{r}{e_2} holds and \(x_{e_1re_2}=0\) otherwise.
    225 The parameters of the models \(\mtrx{U}\) and \(\tnsr{C}\) are trained using an alternating least-squares approach, minimizing a regularized reconstruction loss:
    226 \begin{equation}
    227 	\symcal{L}_\textsc{rescal}(\tnsr{X}; \mtrx{U}, \tnsr{C}) = \frac{1}{2} \sum_{\substack{e_1,e_2\in\entitySet\\r\in\relationSet}} (x_{e_1re_2} - \vctr{u}_{e_1}\transpose \mtrx{C}_r \vctr{u}_{e_2})^2 + \frac{1}{2} \lambda ( \|\mtrx{U}\|_F^2 + \sum_{r\in\relationSet} \|\mtrx{X}_r\|_F^2 )
    228 \end{equation}
    229 
    230 Using bilinear forms allows \textsc{rescal} to capture entities interactions for each relation in a simple manner.
    231 However, the number of parameters to estimate grows quadratically with respect to the dimension of the semantic space \(d\).
    232 This can be prohibitive as a large \(d\) is needed to ensure accurate modeling of the entities.
    233 
    234 \subsubsection{TransE}
    235 \label{sec:context:transe}
    236 To find a balance between the number of parameters and the expressiveness of the model, geometric approaches were developed starting with TransE~\parencitex{transe}.
    237 TransE proposes to leverage the regularity exhibited by Figure~\ref{fig:context:word2vec pca} to embed both entities and relations in the same vector space.
    238 Formally, its assumption is that relations can be represented as translations between entities' embeddings.
    239 In addition to representing each entity \(e\) by an embedding \(\vctr{u}_e\in\symbb{R}^d\), each relation \(r\) is also embedded as a translation in the same space as \(\vctr{v}_r\in\symbb{R}^d\).
    240 The idea being that if \tripletHolds{e_1}{r}{e_2} holds then \(\vctr{u}_{e_1} + \vctr{v}_r \approx \vctr{u}_{e_2}\).
    241 The authors argue that translations can represent hierarchical data by drawing a parallel with the embedding of a tree in an Euclidean plane---that is the usual representation of a tree as drawn on paper.
    242 As long as the distance between two levels in the tree is large enough, the children of a node are close together; this not only allows for the representation of one-to-many relations ``child'' but also for the many-to-many, symmetric and transitive relation ``sibling'' as the null translation.
    243 
    244 In order to enforce the translation property, a margin-based loss is used to train an energy-based model.
    245 The energy of true triplets drawn from the knowledge base is minimized, while negative triplets are sampled and have their energy maximized up to a certain margin.
    246 Given a positive triplet \((e_1, r, e_2)\) and a negative triplet \((e_1', r, e_2')\), the TransE loss can be expressed as:
    247 \begin{equation}
    248 	\symcal{L}_\textsc{te}(e_1, r, e_2, e_1', e_2') = \max\left(0, \gamma + \Delta(\vctr{u}_{e_1} + \vctr{v}_r, \vctr{u}_{e_2}) - \Delta(\vctr{u}_{e_1'} + \vctr{v}_r, \vctr{u}_{e_2'})\right),
    249 	\label{eq:context:transe loss}
    250 \end{equation}
    251 where \(\Delta\) is a distance function such as the squared Euclidean distance \(\Delta(\vctr{u}_{e_1} + \vctr{v}_r, \vctr{u}_{e_2}) = \|\vctr{u}_{e_1} + \vctr{v}_r - \vctr{u}_{e_2} \|_2^2\).
    252 The negative triplets \((e_1', r, e_2')\) are sampled by replacing one of the two entities of \((e_1, r, e_2)\) by a random one which is sampled uniformly over all possible entities:
    253 \begin{equation*}
    254 \begin{split}
    255 	N(e_1, e_2) = &
    256 		\begin{cases}
    257 			(e_1, e') & \text{ with probability } 50\% \\
    258 			(e', e_2) & \text{ with probability } 50\% \\
    259 		\end{cases} \\
    260 		& \text{with } e' \sim \uniformDistribution(\entitySet).
    261 \end{split}
    262 \end{equation*}
    263 
    264 Since \(d\) is a distance, when the loss \(\symcal{L}_\textsc{te}\) is perfectly minimized, the positive part \(+\Delta(\vctr{u}_{e_1} + \vctr{v}_r, \vctr{u}_{e_2})\) is 0.
    265 This means that the negative part \(-\Delta(\vctr{u}_{e_1'} + \vctr{v}_r, \vctr{u}_{e_2'})\) contributes to the loss only when it is smaller than the margin \(\gamma\).
    266 Since this criterion depends on the distance between entities, it can easily be optimized by increasing the entity embeddings norms.
    267 To avoid this degenerate solution, the entity embeddings are renormalized at each training step.
    268 The training loop and initialization procedure are detailed in Algorithm~\ref{alg:context:transe}.
    269 Parameters \(\mtrx{U}\) and \(\mtrx{V}\) are optimized by stochastic gradient descent with early-stopping based on validation performance.
    270 \begin{marginalgorithm}[-1cm]
    271 	\input{mainmatter/context/transe.tex}
    272 	\scaption[The TransE training algorithm.]{
    273 		The TransE training algorithm.
    274 		The relations are initialized randomly on the sphere but are free to drift away afterward, while entities are renormalized at each iteration.
    275 		The loop updates parameters \(\mtrx{U}\) and \(\mtrx{V}\) using gradient descent and is stopped based on validation score.
    276 		The gradient of \(\symcal{L}_\textsc{te}\) is computed from Equation~\ref{eq:context:transe loss}.
    277 		\label{alg:context:transe}
    278 	}
    279 \end{marginalgorithm}
    280 
    281 \paragraph{Evaluation}
    282 The quality of the embeddings can be evaluated by measuring the accuracy of entity prediction based on them.
    283 Given a true triplet \((e_1, r, e_2)\in\kbSet\), the energy \(\Delta(\vctr{u}_{e'} + \vctr{v}_r, \vctr{u}_{e_2})\) is computed for all possible entities \(e'\in\entitySet\).
    284 The entity minimizing the energy is predicted as completing the triplet.
    285 The same procedure is then applied on \(e_2\).
    286 The correct entity minimizes the energy quite rarely, therefore in order to have a more informative score \textcite{transe} reports the mean rank of the correct entity among all the entities ranked by the energy of their associated triplets.
    287 For reference, on WordNet, the mean rank of the correct entity is 263 among 40\,943 entities.
    288 
    289 When expanding the expression \(\Delta(\vctr{u}_{e_1} + \vctr{v}_r, \vctr{u}_{e_2})\) where \(d\) is the Euclidean distance, the main term ends up being \(\vctr{u}_{e_1}\transpose\vctr{u}_{e_2} + \vctr{v}_r\transpose (\vctr{u}_{e_2} - \vctr{u}_{e_1})\).
    290 As such, TransE captures all 2-way interactions between \(e_1\), \(r\) and \(e_2\).
    291 However, this means that 3-way interactions are not captured, this is however standard in information extraction.
    292 Furthermore, TransE is unable to model several symmetric relations (when \(r=\breve{r}\)).
    293 To solve these problems, other geometric transformations were proposed to improve TransE expressiveness, such as first projecting entities on a hyperplane (TransH, \cite{transh}) or having the entities and relations live in different spaces (TransR, \cite{transr}).
    294 Finally, all the methods mentioned in this section are not only useful for entity and relation predictions, but also as methods to obtain distributed representations of knowledge bases entities and relations.
    295 The matrices \(\mtrx{U}\) and \(\mtrx{V}\) learned by TransE can subsequently be used for other tasks involving knowledge bases, in the same way that transfer learning is used to obtain distributed representations of text using language models (Section~\ref{sec:context:transfer learning}).